In this case study, I am working for a fictional company, Cyclistic, and I am asked to answer key business questions. I am asked to follow the steps of the data analysis process - ask, prepare, process, analyze, share and act.
Specifically, I am working in a Marketing analyst team at Cyclistic, a bike-sharing company in Chicago. The director of the marketing believes the company’s future success depends on maximizing the number of annual memberships. My team is interested in the different trends of how casual riders and annual riders use Cyclistic Bikes differently. They want to design a new marketing strategy to convert casual riders into annual members.
Step 3 - Process Data
- Documentation of any cleaning or manipulation of data
Using the programming language R, I will process, filter, and analyze the data. R is a powerful tool for data analysis because it is flexible, reproducible, and optimized for cleaning and visualizing large data.
knitr::opts_chunk$set(dpi = 300)
# install packages and load libraries
if(!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if(!require(rio)) install.packages("rio")
## Loading required package: rio
if(!require(skimr))install.packages("skimr")
## Loading required package: skimr
if(!require(rmarkdown))install.packages("rmarkdown")
## Loading required package: rmarkdown
if(!require(doParallel))install.packages("doParallel")
## Loading required package: doParallel
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Loading required package: parallel
if(!require(lubridate))install.packages("lubridate")
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
if(!require(patchwork))install.packages("patchwork")
## Loading required package: patchwork
library(doParallel)
library(skimr)
library(lubridate)
library(rmarkdown)
library(tidyverse)
library(rio)
library(patchwork)
#rio is a newer version of read_csv, and allows you
#to import all sorts of file extensions. It also
#correctly imports the date, instead of characters
#as read_csv.
# do parallel computing in order to make model and
# plot calculation faster.
c1 <- makePSOCKcluster(5)
registerDoParallel(c1)
options(scipen=999)
We will start with importing the data.
##What I wrote below needs to feed into something,
#so I imported the first entry.
data <- import("202101-divvy-tripdata.csv")
for (i in 2:12) {
if (i >= 10) {
path = paste("2021", as.character(i), "-divvy-tripdata.csv", sep="")
newdata <- import(path) }
else {
path = paste("20210", as.character(i), "-divvy-tripdata.csv", sep="")
newdata <- import(path) }
data = rbind(data,newdata)}
Now, let us look through the data.
head(data)
## ride_id rideable_type started_at ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660
## 2 California Ave & Cortez St 17660
## 3 California Ave & Cortez St 17660
## 4 California Ave & Cortez St 17660
## 5 California Ave & Cortez St 17660
## 6 California Ave & Cortez St 17660
## start_lat start_lng end_lat end_lng member_casual
## 1 41.90034 -87.69674 41.89 -87.72 member
## 2 41.90033 -87.69671 41.90 -87.69 member
## 3 41.90031 -87.69664 41.90 -87.70 member
## 4 41.90040 -87.69666 41.92 -87.69 member
## 5 41.90033 -87.69670 41.90 -87.70 casual
## 6 41.90041 -87.69676 41.94 -87.71 casual
str(data)
## 'data.frame': 5595063 obs. of 13 variables:
## $ ride_id : chr "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct, format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct, format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
glimpse(data)
## Rows: 5,595,063
## Columns: 13
## $ ride_id <chr> "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683~
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", ~
## $ started_at <dttm> 2021-01-23 16:14:19, 2021-01-27 18:43:08, 2021-01-~
## $ ended_at <dttm> 2021-01-23 16:24:44, 2021-01-27 18:47:12, 2021-01-~
## $ start_station_name <chr> "California Ave & Cortez St", "California Ave & Cor~
## $ start_station_id <chr> "17660", "17660", "17660", "17660", "17660", "17660~
## $ end_station_name <chr> "", "", "", "", "", "", "", "", "", "Wood St & Augu~
## $ end_station_id <chr> "", "", "", "", "", "", "", "", "", "657", "13258",~
## $ start_lat <dbl> 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4~
## $ start_lng <dbl> -87.69674, -87.69671, -87.69664, -87.69666, -87.696~
## $ end_lat <dbl> 41.89000, 41.90000, 41.90000, 41.92000, 41.90000, 4~
## $ end_lng <dbl> -87.72000, -87.69000, -87.70000, -87.69000, -87.700~
## $ member_casual <chr> "member", "member", "member", "member", "casual", "~
summary(data)
## ride_id rideable_type started_at
## Length:5595063 Length:5595063 Min. :2021-01-01 00:02:05
## Class :character Class :character 1st Qu.:2021-06-06 23:52:40
## Mode :character Mode :character Median :2021-08-01 01:52:11
## Mean :2021-07-29 07:41:02
## 3rd Qu.:2021-09-24 16:36:16
## Max. :2021-12-31 23:59:48
##
## ended_at start_station_name start_station_id
## Min. :2021-01-01 00:08:39 Length:5595063 Length:5595063
## 1st Qu.:2021-06-07 00:44:21 Class :character Class :character
## Median :2021-08-01 02:21:55 Mode :character Mode :character
## Mean :2021-07-29 08:02:58
## 3rd Qu.:2021-09-24 16:54:05
## Max. :2022-01-03 17:32:18
##
## end_station_name end_station_id start_lat start_lng
## Length:5595063 Length:5595063 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual
## Min. :41.39 Min. :-88.97 Length:5595063
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.17 Max. :-87.49
## NA's :4771 NA's :4771
skim(data)
Data summary
| Name |
data |
| Number of rows |
5595063 |
| Number of columns |
13 |
| _______________________ |
|
| Column type frequency: |
|
| character |
7 |
| numeric |
4 |
| POSIXct |
2 |
| ________________________ |
|
| Group variables |
None |
Variable type: character
| ride_id |
0 |
1 |
16 |
16 |
0 |
5595063 |
0 |
| rideable_type |
0 |
1 |
11 |
13 |
0 |
3 |
0 |
| start_station_name |
0 |
1 |
0 |
53 |
690809 |
848 |
0 |
| start_station_id |
0 |
1 |
0 |
36 |
690806 |
835 |
0 |
| end_station_name |
0 |
1 |
0 |
53 |
739170 |
845 |
0 |
| end_station_id |
0 |
1 |
0 |
36 |
739170 |
833 |
0 |
| member_casual |
0 |
1 |
6 |
6 |
0 |
2 |
0 |
Variable type: numeric
| start_lat |
0 |
1 |
41.90 |
0.05 |
41.64 |
41.88 |
41.90 |
41.93 |
42.07 |
▁▁▇▇▁ |
| start_lng |
0 |
1 |
-87.65 |
0.03 |
-87.84 |
-87.66 |
-87.64 |
-87.63 |
-87.52 |
▁▁▆▇▁ |
| end_lat |
4771 |
1 |
41.90 |
0.05 |
41.39 |
41.88 |
41.90 |
41.93 |
42.17 |
▁▁▁▇▁ |
| end_lng |
4771 |
1 |
-87.65 |
0.03 |
-88.97 |
-87.66 |
-87.64 |
-87.63 |
-87.49 |
▁▁▁▁▇ |
Variable type: POSIXct
| started_at |
0 |
1 |
2021-01-01 00:02:05 |
2021-12-31 23:59:48 |
2021-08-01 01:52:11 |
4677998 |
| ended_at |
0 |
1 |
2021-01-01 00:08:39 |
2022-01-03 17:32:18 |
2021-08-01 02:21:55 |
4671372 |
This is big data, with 559k observations over 13 columns. Now let us process the data. There might be some NAs, so lets remove the columns with those and remove duplicate entries. Let us remove rows with blank information, new columns with data on month, day, hour and route and remove the information from the imported data that we don’t need to make our processing faster.
data = na.omit(data)
data = distinct(data)
data = data %>%
filter(start_station_id != "" & end_station_name != "" &
start_station_name != "" & end_station_id != "" ) %>%
mutate(duration = round(difftime(ended_at, started_at, units = "mins"))) %>%
mutate(month = month(started_at, label = TRUE)) %>%
mutate(wday = wday(started_at, label = TRUE)) %>%
mutate(hour = hour(started_at)) %>%
mutate(route = paste(start_station_name, end_station_name, sep = " to ")) %>%
select(-c(start_station_name, end_station_name, ride_id,
start_station_id, end_station_id, start_lat,
start_lng, end_lat, end_lng))
Some of the information in the data are strings, and can’t be grouped together to be used further in the analysis. Let us make everything into factors so this is possible, remove entries with a negative trip duration, and sort everything in descending order by duration.
data <- as.data.frame(unclass(data), stringsAsFactors = TRUE)
data = data %>%
filter(duration > 0) %>%
arrange(-duration)
Let us group everything by month, member_casual (if they are members or casual riders), rideablee_type (there are 3 types of bikes - classic, docked and electric bikes) and look at some stats for the groups.
data %>%
group_by(member_casual, wday) %>%
summarize(mean = round(mean(duration)), med = median(duration),
count = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
## # A tibble: 14 x 5
## # Groups: member_casual [2]
## member_casual wday mean med count
## <fct> <ord> <drtn> <drtn> <int>
## 1 casual Sun 38 mins 20 mins 401446
## 2 casual Mon 33 mins 17 mins 227583
## 3 casual Tue 29 mins 15 mins 213663
## 4 casual Wed 28 mins 14 mins 216898
## 5 casual Thu 28 mins 14 mins 222924
## 6 casual Fri 31 mins 16 mins 288407
## 7 casual Sat 35 mins 19 mins 465725
## 8 member Sun 15 mins 11 mins 307771
## 9 member Mon 13 mins 9 mins 342941
## 10 member Tue 13 mins 9 mins 384495
## 11 member Wed 13 mins 9 mins 393914
## 12 member Thu 12 mins 9 mins 369888
## 13 member Fri 13 mins 10 mins 362108
## 14 member Sat 15 mins 11 mins 353215
Casual members seem to ride about ~2x more than members, with the most on weekends. Members seem to ride ~15min, while casuals ride around ~30min.
Step 4 & 5 - Analyze and Visualize Data
These two steps will be done at the same time.
Let us start making some visuals for the data.
#Percentage of casual and member rides over the last year. Members
#have ridden more times than casuals. Lets break this down further.
#1. Members (55%) has ridden more rides than casuals (45%).
data %>%
group_by(member_casual)%>%
summarize(count = n()) %>%
mutate(percent = round(count*100/sum(count),0)) %>%
ggplot(aes(x = "", y = count, fill = member_casual)) +
geom_bar(width = 1, stat = "identity", color = "white", show.legend = FALSE) +
coord_polar("y", start = 0) +
geom_text(aes(label =
paste(member_casual,
paste(percent, "%"),
sep = "\n")),
position = position_stack(vjust = 0.6),
color = "white") +
labs(title = "Percentage of Casual and Member Rides Over the Last Year")+
theme_void()
. It is found that Members (55%) have ridden more rides than casuals (45%) Let’s break this data down further by bike type.
#Breaking down the data further by bike type. For some reason, there are no
#Docked bikes used by members.
#2. The most popular bike for members and casuals were classic bikes.
data %>%
group_by(member_casual, rideable_type)%>%
summarize(count = n()) %>%
mutate(percent = round(count*100/sum(count),0)) %>%
ggplot(aes(x = "", y = count, fill = member_casual, rideable_type)) +
geom_bar(width = 1, stat = "identity", color = "white", show.legend = TRUE) +
coord_polar("y", start = 0) +
geom_text(aes(label =
paste(rideable_type,
paste(percent, "%"),
sep = "\n")),
position = position_stack(vjust = 0.5),
color = "white") +
labs(title = "Percentage of Casual and Member Rides by Bike Type Over the Last Year")+
theme_void()
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
. The most popular bike for members (78%) and casuals (62%) were classic bikes. Members did not use any docked bikes.
Lets investigate the duration of bike rides.
#Lets do the same plot but now for percentage of duration of bike rides.
#3. Although classic bikes are the most popular, docked bikes had the
#greatest duration in the casual member groups. They were barely used in the
#members group.
data %>%
group_by(member_casual, rideable_type) %>%
summarize(dur = mean(duration), count = n()) %>%
ggplot(aes(x = rideable_type, y = dur,
color = member_casual,
fill = member_casual))+
geom_bar(stat = "identity") +
facet_wrap(~member_casual) +
labs (title = "Average Ride Duration Per Bike Type Per Member/Casual",
x = "Type of Bike", y = "Average Duration of Ride (min)") +
theme_minimal() +
theme(axis.text.x =
element_text(size = 10))
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
. We have found another insight. Although classic bikes are the most popular, docked bikes had the greatest duration in the casual member groups. They were barely used in the members group.
#Lets explore the number of rides per member/casual per month.
#4. Members rode more times than casuals using classic bikes. Members did
#not use docked bikes. In the summer (Jun to Aug), casual rides were greater than
#member rides. This was mostly driven by classic bikes in the casuals group,
# and partly by docked bikes in the casuals group that was absent in the members
# group.
data %>%
group_by(month, member_casual) %>%
summarize(count = n()) %>%
ggplot +
aes(x=month, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Number of Rides by Month by Casual/Member",
x = "Month", y = "Number of Rides")
## `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.
. Members rode more times than casuals using classic bikes. Members did not use docked bikes. In the summer (Jun to Aug), casual rides were greater than member rides. Let us break this down further into rideable bike type for more info.
data %>%
group_by(month, member_casual,rideable_type) %>%
summarize(count = n()) %>%
ggplot +
aes(x=month, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Number of Rides by Month by Casual/Member by Bike Type",
x = "Month", y = "Number of Rides") +
facet_wrap(~rideable_type)
## `summarise()` has grouped output by 'month', 'member_casual'. You can override using the `.groups` argument.
. In the analysis of #4, this was mostly driven by classic bikes in the casuals group, and partly by docked bikes in the casuals group. Docked bikes were absent in the members group.
Let us look at the number of rides per member/casual per weekday.
#Lets explore the number of rides per member/casual per weekday.
#6. Members rode the most times in the middle of the week, being fairly consistent
# throughout the whole week. Casuals rode the most on weekends.
data %>%
group_by(wday, member_casual) %>%
summarize(count = n()) %>%
ggplot +
aes(x=wday, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Number of Rides by Weekday by Casual/Member",
x = "Weekday", y = "Number of Rides")
## `summarise()` has grouped output by 'wday'. You can override using the `.groups` argument.
Let us break this data down by bike type.
data %>%
group_by(wday, member_casual,rideable_type) %>%
summarize(count = n()) %>%
ggplot +
aes(x=wday, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Number of Rides by Weekday by Casual/Member by Bike Type",
x = "Weekday", y = "Number of Rides") +
facet_wrap(~rideable_type)
## `summarise()` has grouped output by 'wday', 'member_casual'. You can override using the `.groups` argument.
. Members rode the most times in the middle of the week, being fairly consistent throughout the whole week. Casuals rode the most on weekends.
Let us look at the number of rides per member/casual per starting time hour of the day.
#Lets explore the number of rides per member/casual per hour of the day.
#7. Casuals and members took the most rides 3-8pm.
data %>%
group_by(hour, member_casual) %>%
summarize(count = n()) %>%
ggplot +
aes(x=hour, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Starting time of Rides by Hour of the Day by Casual/Member",
x = "Hour of the day", y = "Number of Rides")
## `summarise()` has grouped output by 'hour'. You can override using the `.groups` argument.

Let us break this data down by bike type.
data %>%
group_by(hour, member_casual,rideable_type) %>%
summarize(count = n()) %>%
ggplot +
aes(x=hour, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Starting time of Rides by Month by Casual/Member by Bike Type",
x = "Hour of the day", y = "Number of Rides") +
facet_wrap(~rideable_type)
## `summarise()` has grouped output by 'hour', 'member_casual'. You can override using the `.groups` argument.
. Casuals and members took the most rides 3-8pm.
Lets look at the duration of the rides per member/casual per month.
# Same thing, but with duration
#Lets explore the duration of rides per member/casual per month.
#8. Casual rides on average were about 2x longer (~30min) than members throughout the year (~15min).
# The longest rides for casuals were driven by docked bikes and these long bike rides
# were undertaken with the longest duration during February and July.
data %>%
group_by(month, member_casual) %>%
summarize(count = mean(duration)) %>%
ggplot +
aes(x=month, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Duration of Rides by Month by Casual/Member",
x = "Month", y = "Duration of Rides (min)")
## `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.
Let’s break this data down by bike type.
data %>%
group_by(month, member_casual,rideable_type) %>%
summarize(count = mean(duration)) %>%
ggplot +
aes(x=month, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Duration of Rides by Month by Casual/Member",
x = "Month", y = "Duration of Rides (min)") +
facet_wrap(~rideable_type)
## `summarise()` has grouped output by 'month', 'member_casual'. You can override using the `.groups` argument.
. Casual rides on average were about 2x longer (~30min) than members throughout the year (~15min). The longest rides for casuals were driven by docked bikes and these long bike rides were undertaken with the longest duration during February and July. The average duration for docked bike casual rides was ~80min, with a spike in duration to ~155min in February.
Lets explore the duration of rides per member/casual per weekday.
#Lets explore the duration of rides per member/casual per weekday.
#9. Members were consistent in their duration across the week, casuals had the
#greatest duration on weekends. Duration of Casual Rides increased to ~35min on weekends from ~30 min.
data %>%
group_by(wday, member_casual) %>%
summarize(count = mean(duration)) %>%
ggplot +
aes(x=wday, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Duration of Rides by Weekday by Casual/Member",
x = "Weekday", y = "Duration of Rides (min)")
## `summarise()` has grouped output by 'wday'. You can override using the `.groups` argument.
Let’s break this data down by rideable type.
data %>%
group_by(wday, member_casual,rideable_type) %>%
summarize(count = mean(duration)) %>%
ggplot +
aes(x=wday, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Number of Rides by Weekday by Casual/Member by Bike Type",
x = "Weekday", y = "Duration of Rides (min)") +
facet_wrap(~rideable_type)
## `summarise()` has grouped output by 'wday', 'member_casual'. You can override using the `.groups` argument.
. Members were consistent in their duration across the week, casuals had the greatest duration on weekends. Duration of Casual Rides increased to ~35min on weekends from ~30 min. Docked bike casual riders had an average of ~80min duration ride.
Lets explore the duration of rides per member/casual per hour of the day.
#Lets explore the duration of rides per member/casual per hour of the day.
#10. Casuals took the longest rides starting at 2-4am, being mostly driven by
# docked bike usage. Members were consistent with bike ride duration throughout
# all hours.
data %>%
group_by(hour, member_casual) %>%
summarize(count = mean(duration)) %>%
ggplot +
aes(x=hour, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Starting time of Duration of Rides by Hour of the Day by Casual/Member",
x = "Hour of the day", y = "Duration of Rides (min)")
## `summarise()` has grouped output by 'hour'. You can override using the `.groups` argument.
Let’s break this down by bike type.
data %>%
group_by(hour, member_casual,rideable_type) %>%
summarize(count = mean(duration)) %>%
ggplot +
aes(x=hour, y=count, color = member_casual, group = member_casual) +
geom_point() + geom_line(size = 1) +
scale_y_continuous(limits = c(0, NA),
expand = expansion(mult = c(0, 0.1))) +
labs (title = "Starting time of Duration of Rides by Month by Casual/Member by Bike Type",
x = "Hour of the day", y = "Duration of Rides (min)") +
facet_wrap(~rideable_type)
## `summarise()` has grouped output by 'hour', 'member_casual'. You can override using the `.groups` argument.
. Casuals took the longest rides starting at 2-4am, being mostly driven by docked bike usage. Members were consistent with bike ride duration throughout all hours.
Let us now investigate the top 10 routes used my members, casuals and casuals that use docked bikes.
##Get the top 10 routes for members and casuals
casual_routes <- data %>%
filter(member_casual == "casual") %>%
group_by(route) %>%
tally(sort = TRUE)
casual_routes = casual_routes[1:10,]
casual_droutes <- data %>%
filter(member_casual == "casual") %>%
filter(rideable_type == "docked_bike") %>%
group_by(route) %>%
tally(sort = TRUE)
casual_droutes = casual_droutes[1:10,]
member_routes <- data %>%
filter(member_casual == "member") %>%
group_by(route) %>%
tally(sort = TRUE)
member_routes = member_routes[1:10,]
casual_routes %>%
ggplot(aes(x = reorder(route, n),
y = n)) +
geom_bar(stat = "identity",
fill = "blue") +
xlab("Most Frequently Used Top 10 Routes") +
ylab("Number of Rides") +
coord_flip() +
labs(title = "Top 10 Routes for Casual Riders") +
theme_minimal()
- The most popular route for casual riders was Streeter Dr & Grand Ave to Streeter Dr & Grand Ave. This seems to be a ride out to somewhere and then back to the same location.
casual_droutes %>%
ggplot(aes(x = reorder(route, n),
y = n)) +
geom_bar(stat = "identity",
fill = "blue") +
xlab("Most Frequently Used Top 10 Routes") +
ylab("Number of Rides") +
coord_flip() +
labs(title = "Top 10 Routes for Casual Docked Bike Riders") +
theme_minimal()
. The most popular popular route for docked bike casual riders was the same as casual riders as a whole - Streeter Dr & Grand Ave to Streeter Dr & Grand Ave.
member_routes %>%
ggplot(aes(x = reorder(route, n),
y = n)) +
geom_bar(stat = "identity",
fill = "blue") +
xlab("Most Frequently Used Top 10 Routes") +
ylab("Number of Rides") +
coord_flip() +
labs(title = "Top 10 Routes for Member Riders") +
theme_minimal()
. The most popular popular route for member riders was Ellis Ave & 60th St to Ellis Ave & 55th St, followed by Ellis and 55th St to Ellis and 60th St.